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ABSTRACT 

Motivation: The decision whether a measured distribution 
complies with an equidistribution is a central element of 
many biostatistical methods. High throughput differential 
expression measurements, for instance, necessitate to 
judge possible over-representation of genes. The reliabil- 
ity of this judgement, however, is strongly affected when 
rarely expressed genes are pooled. We propose a method 
that can be applied to frequency ranked distributions and 
that yields a simple but efficient criterion to assess the 
hypothesis of equiprobable expression levels. 
Results: By applying our technique to surrogate data we 
exemplify how the decision criterion can differentiate be- 
tween a true equidistribution and a triangular distribution. 
The distinction succeeds even for small sample sizes 
where standard tests of significance (e.g. x 2 ) fail- Our 
method will have a major impact on several problems of 
computational biology where rare events baffle a reliable 
assessment of frequency distributions. 
Availability: The program package is available upon 
request from the authors. 
Contact: thorsten.poeschel@charite.de 

INTRODUCTION 

Biostatistical analyses quite generally infer desired infor- 
mation from experimentally observed frequency distribu- 
tions. Microarrays, for instance, can screen thousands of 
genes simultaneously, thus allowing for high-throughput 
measurements ( Gershon, 2002 ). The expression level of 
genes is quantified via hybridisation signal intensities and, 
after an appropriate normalisation procedure, yields a vec- 
tor of numbers which can be understood as an empirically 
obtained frequency distribution. The important issues of 
coregulat ion (Kielbasa et al., 2001 ) and differential ex- 
pression ( Tsodikov et al., 2001 ) are based on further anal- 
ysis of these distributions. In this context, pooling a family 
of rarely expressed genes can be combined with the task 
to decide whether the observe d expression patter n com- 
plies with an equidistribution ( Tandle et al., 2001 ). As is 
intuitively clear, the small sample size makes this decision 
a hard problem - nobody would expect to reproduce the 



statistics of a fair die (with equal probability 1 /6 for each 
side) with only four trials. The p roblem of small sa mple 
statistics pervade s literature, e.g. (Storey et al., 199a ; Hel- 
land et al., 1998; Josefssonet al., 1998). Similar problems 



might occur in the stati stical analyses of codon distribu- 
tions ( |Som et al., 2001 ) or other biopolymers (Ramsden 
and Vohradsky, 1998). As a last field of potential applica- 
tion let us mention the computational comparison of tw o 
draft sequences of the human genome ( Aach et al., 2001 ). 

We propose a method to decide the above posed ques- 
tion of an equidistribution based on frequency ranked 
statistics. The technique yields a criterion that can detect 
frequency distributions generated from a true equidistri- 
bution and reject others. It is important to note that the 
criterion we have devised is rather efficient for small sam- 
ple sizes (expression levels) where standard measures of 
significance like the x 2 -test or the Kolmogorov-Smirnov 
test fail. We expect that biostatistical analyses of small 
sample data, as met e.g. for rarely expressed genes, can 
profit from our criterion. 

SYSTEM AND METHODS 

Assume, in an experimental sampling probe we find N* 
different species occurring with relative frequencies /i, 
fi ■ ■■ In*- The term species is not meant in its strict 
biological sense here but more general as a class of 
individuals. As an example we mention the subsequences 
of a certain length, e.g. I = 6, of the nuclein acids A, G, 
C, and T. A species is the word GATAGG which may 
be found in a gene at various positions, e.g., at positions 
45, 122, and 431. For this example we say that the species 
GATAGG occurs with 3 representatives. 

Quite often uniform probabilities are motivated by 
theoretical considerations or as the simplest assumption. 
We present a practical criterion based on finite sample 
size statistics that allows to decide, i.e. to accept or 
reject, the hypothesis of an underlying uniform probability 
distribution p 1 = p 2 = - - - = Pn = 1/N with N > N*. 

Given a probability distribution {p i5 i = 1, . . . , N} 

with ^2 =1 Pi = 1 where the pi represent the likelihood 
to find a representative of species i from a set of N 
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possible species. Assume, in an experiment we find M x 
copies of species 1, M 2 copies of species 2 ... Mjy. copies 

of species N*, with X) =i ^ = M and none of the 
remaining species N* + 1, . . . , N. The fact that not all of 
the N species might be observed means N* < N. Stated 
differently, the number N is not directly accessible by 
the experiment but has to be estimated using probabilistic 
arguments. However, beyond all methods of optimally 
estimating the unknown N - which then completely 
specifies the assumed equidistribution - the question arises 
how probable the hypothesis of an equidistribution is at all. 

Clearly, the law of large numbers asserts the stochastic 
convergence of relative frequencies towards the related 
probabilities, i.e. 



Pi 



lim 



fi ) 



f = ^ 



(i) 



where the limit is approached for almost all (in the 
mathematical sense) experimental realizations (sampling 
probes). For cases where the condition M S> N is not 
fulfilled, however, the relative frequencies often deviate 
considerably from the related probabilities, e.g. Schmi tt 
et al. (1993); |Herzel et al. (1994| ); [Poschel et al. (1995| ). 
Strictly speaking, for finite M one cannot even be sure to 
have found each species at least once. Just imagine one or 
a few species with probabilities being orders of magnitude 
smaller than the overwhelming rest of nearly identically 
probable species. One might say that in such a situation all 
the tiny probability events are dispensable for an efficient 
description and a uniform distribution is true rather in a 
practical sense. However, in other situations deviations 
of an underlying probability distribution from a true 
equidistribution can be that significant that the hypothesis 
of a uniform probability distribution should be judged as 
inappropriate. As an example consider a triangle shaped 
distribution. Now, how does this substantial discrepancy 
show up in experimental sampling probes and how can this 
be distinguished from pure finite sample size effects? 

To illustrate typical distortions of the equidistribution 
due to finite sample size we depict Fig. [I] (left) which 
shows a frequency distribution obtained by drawing M = 
10 4 equidistributed random integers from the interval 
[1, 2, . . . 1000], i.e., N = 1000. From the probabilities 
Pi = P2 = ---Pn = 10~ 3 we expect to find each 
of the numbers, on average, M/N times, i.e., (fx) = 
(f 2 ) = . . . = (/jv) = 10- Figure |l| shows that there 
are large fluctuations of the occurrences of the numbers. 
A more convenient way to represent these data is the 
rank ordered frequency distribution. To this end we re- 
order the abscissa in a way to receive a decaying curve of 
frequencies, i.e., the most frequent species occupies rank 
1, the most frequent but one occupies rank 2 etc. The rank 
ordered frequencies are depicted in the right part of Fig. [TJ 
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Fig. 1. The measured frequency distribution may deviate from the 
probability distribution considerably due to finite sample effects. 
The left figure shows absolute frequencies which have been 
obtained by drawing M = 10 4 random integers from the interval 
1 . . . N, N = 1000. The right figure presents the same data but in 
rank order (see text). Whereas the expectation value for each of the 
numbers is M/N — 10 (dashed line) large deviations are 
noticeable. 



Since our random numbers are equidistributed by con- 
struction the expectation value for each of the numbers 
(ranks) is M/N = 10. As seen from Fig. [j] fluctuations 
around this mean are considerable: if one naively inferred 
the probabilities (or concentrations) from the relative fre- 
quencies, one would end up with a relative error of 1 10% 
for small ranks and 90% for large ranks. Being faced with 
a measurement as the one sketched in Fig. [j] it is far from 
trivial to decide whether the objects (in our case random 
numbers) are equidistributed. It is the aim of this paper 
to propose a method which allows to distinguish between 
uniform and non-uniform probability distributions if the 
sample size is too small to identify the probabilities with 
the relative frequencies due to Eq. ([j]). 

Finite size statistics 

For equidistributed events j G [1 . . . N] with Pj = p = 
1/N the probability to find with M trials a number of 
at least ki different events each occurring e xactly i times 
with i = 0, . . . , M reads (|von Mises, 1939|) 



P(k u i) 



Ml 



(i\) k *(M-kii)\ 



(M-kii) 



(2) 

Applying t he Exclusion-Inclusion-Principle (J ohnson and 
Kotz, 1977) to Eq. (|2|) one can derive the probability to find 
with M trials exactly k t different species each occurring 
exactly i times: 



p(ki,i) 



Ml 
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where \ x\ stands for the integer of x. The first moment of 
this probability reads 



^ N "-" d - ^ 



(M-i) 



(4) 



Hence, when randomly drawing M representatives each 
occurring with the same probability p x = p 2 = 
. . . Pn, one expects to find (K ) species zero-times, (K\) 
once, . . . (K M ) species M times. The full derivation of 
Equations (p|) and using rather involv ed algebra, can 
be found in |Freund and Poschel (1995 ); Poschel and 
Freund(1997). 

For our above example, if M random integers have 
been drawn from the interval 1, . . . , N, Eq. (Q) describes 
how many random numbers, on average, occur exactly 
zero times, i.e., never ((K )), once ((if x )), twice {(K 2 )), 
M times ((K M )). Hence, using Eq. (|j), for an 
equidistribution it is possible to calculate the e xpected 
measured frequencies ( |Poschel and Freund, 2002 ). 

In converse direction, it is possible to infer for each i = 
1 . . . M a value from the experimentally observed 
ki by identifying fej = (Ki) (i = 1 . . . M) and making 
use of Eq. (Q). Thus, sampling a true equidistribution, one 
should find 



JV (1) ss iV (2) 



N . 



(5) 



In theory, the distribution (||) exists for i = 1 . . . M where 
the event related to k M corresponds to the extreme case 
that all M representatives belong to the same species. 
In measurements not all k i can be different from zero. 
Therefore, the approximative equation (^|) is valid for all 
upper indices (L) for which the corresponding fcj, / 
has been found in the measurement. 
The estimated values read for % = 1: 



M 



(6) 



and for all other occupation numbers (K. j) 



M\ 1 



M-i 



(7) 



Equation ^ has to be solved numerically by an iteration 
procedure. As discussed below in dependence on the 
variables M, (Ki), and i this equation may have zero, one, 
or two solutions and one has to select the appropriate one. 

To clarify the meaning of the approximate identity signs 
in Eq. (||) we point out that the identification ki = (K^) is 
an approximation which should be amended by statistical 
fluctuations, i.e., ki = (Ki) + (AK { ), with (AK t ) ~ 



\/va.rKi. The variance of Ki for an equidistribution can 
be achieved analytically using the generating function of 
p (ki, i) (see Eq. (||)). It reads (Poschel and Freund, 2002) 



varif. 



K 



(Ki 



1 + 



(K 
M- 



(8) 



(N-2) 



M-2i 



(N-l) 



M- 



(K 



This variance of the Ki can be converted into a charac- 
teristic error interval around the derived Ni simply by ap- 
plying Eq. (Q) not only to Ki but also to K t + AKi and 
Ki - AKi. 

This means, if for each i we plot the set of experi- 
mentally determined numbers N^' together with their 
expected range of fluctuations (±\/variVj)), for an un- 
derlying equidistribution we should be able to draw a 
straight line which passes through all the error intervals. 
On the contrary, if any horizontal line significantly falls 
outside at least one of the intervals the hypothesis of an 
underlying equidistribution should be rejected. 

RESULTS 

We want to illustrate the method by means of an equidis- 
tribution 

Pl = 1/100, f = l,...,100 (9) 
and a triangular distribution 



101 ) 



1,...,100. (10) 



As discussed above, when drawing random events 
according to these probability distributions, the resulting 
rank-ordered frequency distributions will significantly 
depend on the sample size M. Figure || shows the rank- 
ordered frequency distributions for different values of the 
sample size M. Whereas the top row figures (M = 10 4 ) 
clearly reflect the equidistribution (left) and the triangular 
distribution (right), respectively, for smaller sample 
sizes (lower rows) one notices, as expected, significant 
deviations between probabilities and frequencies. The 
central issue is now whether from these small sample 
rank-ordered frequency distributions (lower rows) one 
can nevertheless distinguish between the uniform and the 
triangular probability distribution. 

For all plots drawn in Fig. ^| we calculated the estimated 
total number of events N^' from the observed occupation 
numbers k t for diverse cluster sizes i. The result is shown 
in Fig. |[ As expected, only for the equidistributed species 
(left plots) we find that the approximate constancy of N^ 1 ' 
as expressed by Eq. (||) holds true. Surprisingly, even for 
quite small sample size M = 100, i.e., very poor statis- 
tics, we can clearly distinguish between the frequency dis- 
tributions originating from the equidistribution Eq. (0) and 
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Fig. 2. Frequencies generated from the equidistributed 
probabilities, Eq. (Bf) (left plots) and from the triangular probability 
distribution Eq. ( |lQ| ) for different sample sizes. From top to bottom: 
M = 10 4 , M = 10 3 , M = 500, M = 100. The dashed lines show 
the underlying probability distributions, Eq. (^) (left figures) and 
Eq. (Juj), respectively. 



Figure [| shows the mean 



M 



i=l 



(11) 



Fig. 3. Estimated total number of species due to Eqs. ^ and (Q) for 
the data shown in Fig. ^. For equidistributed species (left figures) 
we find the condition Eq. (^) fulfilled, whereas for triangularly 
distributed species different occupation numbers ki yield 
significantly different estimates N^'K From top to bottom: 
M = 10 4 , M = 10 3 , M = 500, M = 100. The dashed lines 
display the correct value N = 100. 



of the estimated total number of species and the standard 
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deviation 



M 

i=\ 



/ M 



\ 



V 



(12) 



for the equidistribution over the sample size M. The 
summation in Eqs. ( |i~ij ) and (12) runs over all indices i G 
[1, M] which correspond to occupation numbers ki ^ 0. 
The symbol # (ki) stands for the number of these non- 
empty occupations. The larger the sample size the smaller 
the standard deviation. Even for rather small sample sizes 
M ~ 150 the estimated total number of species agrees 
well with the true value N = 100 as shown in the zoomed 
region (lower plot in Fig. Q). The dashed lines show the 
true total number of species, N = 100. 
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Fig. 4. The mean estimated total number of species (circles) and 
the standard deviation s (boxes) due to Eqs. ( |ll| ) and ( (T^ ) for 
species distributed according to the equidistribution Eq. ([]). The 
dashed lines show the true total number of species N = 100. 

Figure || depicts the corresponding data for species 
which were generated from the triangular probability 
distribution Eq. ([To[). For larger sample sizes M we 



obtain a growing mean of the estimated total number 
of species. With increasing sample size the standard 
deviation increases too. 
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2000 

sample size M 

Fig. 5. The mean estimated total number of species (circles) and 
the standard deviation s (boxes) due to Eqs. ( ll]) and ( |l2| ) for 
species distributed according to the triangular distribution Eq. (|ic|). 



SOLUTIONS OF EQ. (7) 

From Eq. (Q) it follows that for a given sample size M 
the curve (Ki), now understood as a function of the total 
number of species N, has a maximum. As an example, 
in Fig. | we plotted (K 20 ) vs. N (for M = 1000). The 
curve has a maximal value (i^2o) maX ~4.6atiV~52.6, 
meaning that there is no uniform probability distribution 
which is in agreement with any larger experimentally 
determined values of fc 2 o- 




40 60 80 100 

number of species N 

Fig. 6. {K20) over the total number of species N for M = 1000. 
The curve has a maximum (A'2o) max ~ 4.6 at N ~ 52.6. 



The extremum of (Ki) for i > 2 can be found from Eq. 
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which occurs for 



(1 - Mf- M {i - M) M_i (l - i) 1 " 1 

(13) 



JV 1: 



M — 1 
i-1 



(14) 



Hence, Eq. (Q) has no solution for any ki > (Ki) max . 
Figure |(| shows also that for any ki < (Ki) ma * there are 
two solutions of Eq. (0): for M = 1000 the value k 20 = 2 
has the solutions near Ni = 40 and N 2 = 71. The reason 
for this behaviour becomes clear when drawing (Ki) over 
i for Ni = 40 and N 2 = 71 (Fig. 0). The curves 
intersect at i = 20 in agreement with the two solutions 
of Eq. ([7]). In the same way each value of k t < (Ki) max 
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Fig. 7. (Ki) over i due to Eq. (g) for M = 1000 and JVi = 40 (full 
line) and N — 71 (dashed line), respectively. The lines intersect at 
i = 20. 

corresponds to two such solutions, TV^ < N 2 \ Hence, 
for a construction of the plots in Fig. || one always has 
to choose the appropriate branch. If we had only one 
measurement ki for a particular i there were no means to 
figure out whether the corresponding N is N% or N 2 . In 
our case, however, there is a series of ki for a certain range 
of i and all of them originate from the same measurement. 

Obviously, the size ki of the largest i which is found 
in the data corresponds to the right branch of the curve 
(Ki) (i) as drawn in Fig. 0, i.e., to N 2 \ If it would 
correspond to the left branch, larger i should still be 
populated. Contrary, the smallest i correspond to N[ . 
From the single humped- shape of the curve (Ki) (i) we 
conclude that there is only one transition from N x to -/V 2 , 
i.e., the solution is 



2 < i* 



Jvi 4) for 
nY 1 for i > i* 



(15) 



where i* has to be determined. For an equidistribution we 
expect that for all i approximate the true N (see Eq. 



(|5|)). Therefore, we choose i* from [i 
minimises 



mm 5 "max 



such that it 



(AN) 



mm . 



(16) 



with 



E £ N± 

— imin l—i* +1 



JVC*") 



E (N. 



(JVC**) 



min 1 

t=i*+l v 



+ 1 



(17) 



(18) 



and with i min and i max being the sizes of the smallest and 
largest clusters which are found in the data set. 

The estimated number N which is drawn in Figs. |3[ [4[ 
and U is, hence, given by Eq. (|l5|) with the condition (|16[). 
Figure d8|) drafts the described procedure. 




Fig. 8. Top: Both solutions of Eq. ([]), N 1 (i) and N 2 (i) for the 
equidistribution M = 500 and N = 100, corresponding to Fig. ^ 
(3rd row, left column). The choice i* = 7 minimises s (see bottom 
figure) and yields TV ~ 96.52. 



DISCUSSION 

Given a sampling probe of size M originating from an 
unknown probability distribution, it may be important to 
decide whether the data set is compatible with an equidis- 
tribution with known or unknown total number of species 
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N. Due to finite sample size effects, experimentally acces- 
sible frequency distributions will always experience de- 
viations from the underlying probability distribution - be 
it uniform or non-uniform. Rank-ordering the frequency 
distribution, being a first step towards a systematic sur- 
vey, still does not help for rather small sample sizes. The 
challenge is to find a criterion, solely based on the fre- 
quency distribution (data set), which allows to accept or 
reject the hypothesis of a uniform probability distribution 
and, in case of acceptance, to render the number N within 
statistical errors. 

We developed a method based on an analytic expression 
for the average number of events (Ki) occurring i times 
(the varying lengths of the plateaus in the rank-ordered 
frequency distributions) which involves the sample size 
M and the a priori (un)known number N. By inversion 
of this relation it is possible to compute for each i an es- 
timate N' 1 ' for the hypothetical N, completely specify- 
ing the assumed equidistribution. For true uniform prob- 
ability distributions these estimates should vary slightly, 
i.e. within expected statistical errors, when varying i. Sub- 
stantial variations of the estimates, i.e. beyond expected 
statistical errors, or trends are a clear signature of a non- 
uniform distribution. 

We exemplified this method by applying it to a uniform 
and a triangular distribution. The separation between 
both by virtue of our criterion is possible down to 
small sample sizes for which visual inspection of rank- 
ordered frequency distributions or standard tests such as 
X 2 and Kolmogorov-Smirnov do not allow for a clear-cut 
distinction. 
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